{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "punL79CN7Ox6"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "_ckMIh7O7s6D"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hAclqSm3OOml"
      },
      "source": [
        "# Using LSTMs with the subwords dataset\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S5Uhzt6vVIB2"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l10c01_nlp_lstms_with_reviews_subwords_dataset.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l10c01_nlp_lstms_with_reviews_subwords_dataset.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KTVx8__oGR9J"
      },
      "source": [
        "In this colab, you'll compare the results of using a model with an Embedding layer and then adding bidirectional LSTM layers.\n",
        "\n",
        "You'll work with the dataset of subwords for the combined Yelp and Amazon reviews.\n",
        "\n",
        "You'll use your models to predict the sentiment of new reviews."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L62G7LTwNzoD"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "\n",
        "from tensorflow.keras.preprocessing.sequence import pad_sequences"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hLcl0QHvDjTV"
      },
      "source": [
        "# Get the dataset\n",
        "\n",
        "Start by getting the dataset containing Amazon and Yelp reviews, with their related sentiment (1 for positive, 0 for negative). This dataset was originally extracted from [here](https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nCOtiRJZbxCH"
      },
      "outputs": [],
      "source": [
        "!wget --no-check-certificate \\\n",
        "    https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P -O /tmp/sentiment.csv"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XuqER_KMD-xS"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "dataset = pd.read_csv('/tmp/sentiment.csv')\n",
        "\n",
        "# Extract out sentences and labels\n",
        "sentences = dataset['text'].tolist()\n",
        "labels = dataset['sentiment'].tolist()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Tbsx1T2CXPNO"
      },
      "outputs": [],
      "source": [
        "# Print some example sentences and labels\n",
        "for x in range(2):\n",
        "  print(sentences[x])\n",
        "  print(labels[x])\n",
        "  print(\"\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "33AthPiALFZK"
      },
      "source": [
        "#Create a subwords dataset\n",
        "\n",
        "We will use the Amazon and Yelp reviews dataset with tensorflow_datasets's SubwordTextEncoder functionality. \n",
        "\n",
        "SubwordTextEncoder.build_from_corpus() will create a tokenizer for us. You could also use this functionality to get subwords from a much larger corpus of text as well, but we'll just use our existing dataset here.\n",
        "\n",
        "We'll create a subword vocab_size of only the 1,000 most common subwords, as well as cutting off each subword to be at most 5 characters.\n",
        "\n",
        "Check out the related documentation for the the subword text encoder [here](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder#build_from_corpus)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6NaicNCcLYyf"
      },
      "outputs": [],
      "source": [
        "import tensorflow_datasets as tfds\n",
        "\n",
        "vocab_size = 1000\n",
        "tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(sentences, vocab_size, max_subword_length=5)\n",
        "\n",
        "# How big is the vocab size?\n",
        "print(\"Vocab size is \", tokenizer.vocab_size)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xvRVoeIVLevh"
      },
      "outputs": [],
      "source": [
        "# Check that the tokenizer works appropriately\n",
        "num = 5\n",
        "print(sentences[num])\n",
        "encoded = tokenizer.encode(sentences[num])\n",
        "print(encoded)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "G_vacTCifklV"
      },
      "outputs": [],
      "source": [
        "# Separately print out each subword, decoded\n",
        "for i in encoded:\n",
        "  print(tokenizer.decode([i]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cT528cptLupl"
      },
      "source": [
        "## Replace sentence data with encoded subwords\n",
        "\n",
        "Now, we'll create the sequences to be used for training by actually encoding each of the individual sentences. This is equivalent to `text_to_sequences` with the `Tokenizer` we used in earlier exercises."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lkseMhxjL09F"
      },
      "outputs": [],
      "source": [
        "for i, sentence in enumerate(sentences):\n",
        "  sentences[i] = tokenizer.encode(sentence)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y21yRuzmL43U"
      },
      "outputs": [],
      "source": [
        "# Check the sentences are appropriately replaced\n",
        "print(sentences[5])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8HrcPHESMBMs"
      },
      "source": [
        "## Final pre-processing\n",
        "\n",
        "Before training, we still need to pad the sequences, as well as split into training and test sets."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "50-hTsogLSL-"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "\n",
        "max_length = 50\n",
        "trunc_type='post'\n",
        "padding_type='post'\n",
        "\n",
        "# Pad all sequences\n",
        "sequences_padded = pad_sequences(sentences, maxlen=max_length, \n",
        "                                 padding=padding_type, truncating=trunc_type)\n",
        "\n",
        "# Separate out the sentences and labels into training and test sets\n",
        "training_size = int(len(sentences) * 0.8)\n",
        "\n",
        "training_sequences = sequences_padded[0:training_size]\n",
        "testing_sequences = sequences_padded[training_size:]\n",
        "training_labels = labels[0:training_size]\n",
        "testing_labels = labels[training_size:]\n",
        "\n",
        "# Make labels into numpy arrays for use with the network later\n",
        "training_labels_final = np.array(training_labels)\n",
        "testing_labels_final = np.array(testing_labels)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PahZm7YEQ8EI"
      },
      "source": [
        "# Create the model using an Embedding"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c_nyQeI0RCCv"
      },
      "outputs": [],
      "source": [
        "embedding_dim = 16\n",
        "\n",
        "model = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.GlobalAveragePooling1D(), \n",
        "    tf.keras.layers.Dense(6, activation='relu'),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "model.summary()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3WRXrx8BRO2L"
      },
      "source": [
        "# Train the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oBKyVYvxRQ_9"
      },
      "outputs": [],
      "source": [
        "num_epochs = 30\n",
        "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
        "history = model.fit(training_sequences, training_labels_final, epochs=num_epochs, validation_data=(testing_sequences, testing_labels_final))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HhLPbUl2AZ0y"
      },
      "source": [
        "# Plot the accuracy and loss"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jzBM1PpJAYfD"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "\n",
        "def plot_graphs(history, string):\n",
        "  plt.plot(history.history[string])\n",
        "  plt.plot(history.history['val_'+string])\n",
        "  plt.xlabel(\"Epochs\")\n",
        "  plt.ylabel(string)\n",
        "  plt.legend([string, 'val_'+string])\n",
        "  plt.show()\n",
        "  \n",
        "plot_graphs(history, \"accuracy\")\n",
        "plot_graphs(history, \"loss\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fwr5inBiWffb"
      },
      "source": [
        "# Define a function to predict the sentiment of reviews\n",
        "\n",
        "We'll be creating models with some differences and will use each model to predict the sentiment of some new reviews.\n",
        "\n",
        "To save time, create a function that will take in a model and some new reviews, and print out the sentiment of each reviews.\n",
        "\n",
        "The higher the sentiment value is to 1, the more positive the review is."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aPNOYiiaha2y"
      },
      "outputs": [],
      "source": [
        "# Define a function to take a series of reviews\n",
        "# and predict whether each one is a positive or negative review\n",
        "\n",
        "# max_length = 100 # previously defined\n",
        "\n",
        "def predict_review(model, new_sentences, maxlen=max_length, show_padded_sequence=True ):\n",
        "  # Keep the original sentences so that we can keep using them later\n",
        "  # Create an array to hold the encoded sequences\n",
        "  new_sequences = []\n",
        "\n",
        "  # Convert the new reviews to sequences\n",
        "  for i, frvw in enumerate(new_sentences):\n",
        "    new_sequences.append(tokenizer.encode(frvw))\n",
        "\n",
        "  trunc_type='post' \n",
        "  padding_type='post'\n",
        "\n",
        "  # Pad all sequences for the new reviews\n",
        "  new_reviews_padded = pad_sequences(new_sequences, maxlen=max_length, \n",
        "                                 padding=padding_type, truncating=trunc_type)             \n",
        "\n",
        "  classes = model.predict(new_reviews_padded)\n",
        "\n",
        "  # The closer the class is to 1, the more positive the review is\n",
        "  for x in range(len(new_sentences)):\n",
        "    \n",
        "    # We can see the padded sequence if desired\n",
        "    # Print the sequence\n",
        "    if (show_padded_sequence):\n",
        "      print(new_reviews_padded[x])\n",
        "    # Print the review as text\n",
        "    print(new_sentences[x])\n",
        "    # Print its predicted class\n",
        "    print(classes[x])\n",
        "    print(\"\\n\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Qg-maex27KPW"
      },
      "outputs": [],
      "source": [
        "# Use the model to predict some reviews   \n",
        "fake_reviews = [\"I love this phone\", \n",
        "                \"Everything was cold\",\n",
        "                \"Everything was hot exactly as I wanted\", \n",
        "                \"Everything was green\", \n",
        "                \"the host seated us immediately\",\n",
        "                \"they gave us free chocolate cake\", \n",
        "                \"we couldn't hear each other talk because of the shouting in the kitchen\"\n",
        "              ]\n",
        "\n",
        "predict_review(model, fake_reviews)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ycJKbMq3K4iy"
      },
      "source": [
        "# Define a function to train and show the results of models with different layers\n",
        "\n",
        "In the rest of this colab, we will define models, and then see the results. \n",
        "\n",
        "Define a function that will take the model, compile it, train it, graph the accuracy and loss, and then predict some results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PevUcINXK3gn"
      },
      "outputs": [],
      "source": [
        "def fit_model_now (model, sentences) :\n",
        "  model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
        "  model.summary()\n",
        "  history = model.fit(training_sequences, training_labels_final, epochs=num_epochs, \n",
        "                      validation_data=(testing_sequences, testing_labels_final))\n",
        "  return history\n",
        "\n",
        "def plot_results (history):\n",
        "  plot_graphs(history, \"accuracy\")\n",
        "  plot_graphs(history, \"loss\")\n",
        "\n",
        "def fit_model_and_show_results (model, sentences):\n",
        "  history = fit_model_now(model, sentences)\n",
        "  plot_results(history)\n",
        "  predict_review(model, sentences)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U13JBiJUG1oq"
      },
      "source": [
        "# Add a bidirectional LSTM\n",
        "\n",
        "Create a new model that uses a bidirectional LSTM.\n",
        "\n",
        "Then use the function we have already defined to compile the model, train it, graph the accuracy and loss, then predict some results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "scTUsFPAG4zP"
      },
      "outputs": [],
      "source": [
        "# Define the model\n",
        "model_bidi_lstm = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)), \n",
        "    tf.keras.layers.Dense(6, activation='relu'), \n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "# Compile and train the model and then show the predictions for our extra sentences\n",
        "fit_model_and_show_results(model_bidi_lstm, fake_reviews)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QsxKPbCnPJTj"
      },
      "source": [
        "# Use multiple bidirectional layers\n",
        "\n",
        "Now let's see if we get any improvements from adding another Bidirectional LSTM layer to the model.\n",
        "\n",
        "Notice that the first Bidirectionl LSTM layer returns a sequence."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3N6Zul47PMED"
      },
      "outputs": [],
      "source": [
        "model_multiple_bidi_lstm = tf.keras.Sequential([\n",
        "    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim, \n",
        "                                                       return_sequences=True)), \n",
        "    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),\n",
        "    tf.keras.layers.Dense(6, activation='relu'),\n",
        "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
        "])\n",
        "\n",
        "fit_model_and_show_results(model_multiple_bidi_lstm, fake_reviews)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ABVYYPwba8Hx"
      },
      "source": [
        "# Compare predictions for all the models\n",
        "\n",
        "It can be hard to see which model gives a better prediction for different reviews when you examine each model separately. So for comparison purposes, here we define some more reviews and print out the predictions that each of the three models gives for each review:\n",
        "\n",
        "*   Embeddings and a Global Average Pooling layer\n",
        "*   Embeddings and a Bidirectional LSTM layer\n",
        "*   Embeddings and two Bidirectional LSTM layers\n",
        "\n",
        "The results are not always what you might expect. The input dataset is fairly small, it has less than 2000 reviews. Some of the reviews are fairly short, and some of the short ones are fairly repetitive which reduces their impact on improving the  model, such as these two reviews:\n",
        "\n",
        "*   Bad Quality.\n",
        "*   Low Quality.\n",
        "\n",
        "Feel free to add more reviews of your own, or change the reviews. The results will depend on the combination of words in the reviews, and how well they match to reviews in the training set. \n",
        "\n",
        "How do the different models handle things like \"wasn't good\" which contains a positive word (good) but is a poor review?\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6XebrXt0jtOy"
      },
      "outputs": [],
      "source": [
        "my_reviews =[\"lovely\", \"dreadful\", \"stay away\",\n",
        "             \"everything was hot exactly as I wanted\",\n",
        "             \"everything was not exactly as I wanted\",\n",
        "             \"they gave us free chocolate cake\",\n",
        "             \"I've never eaten anything so spicy in my life, my throat burned for hours\",\n",
        "             \"for a phone that is as expensive as this one I expect it to be much easier to use than this thing is\",\n",
        "             \"we left there very full for a low price so I'd say you just can't go wrong at this place\",\n",
        "             \"that place does not have quality meals and it isn't a good place to go for dinner\",\n",
        "             ]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tRWGjkJLkY2y"
      },
      "outputs": [],
      "source": [
        "print(\"===================================\\n\",\"Embeddings only:\\n\", \"===================================\",)\n",
        "predict_review(model, my_reviews, show_padded_sequence=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "G2FJR3IVBt30"
      },
      "outputs": [],
      "source": [
        "print(\"===================================\\n\", \"With a single bidirectional LSTM:\\n\", \"===================================\")\n",
        "predict_review(model_bidi_lstm, my_reviews, show_padded_sequence=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "81v1r3Y2BwvC"
      },
      "outputs": [],
      "source": [
        "print(\"===================================\\n\",\"With two bidirectional LSTMs:\\n\", \"===================================\")\n",
        "predict_review(model_multiple_bidi_lstm, my_reviews, show_padded_sequence=False)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "l10c01_nlp_lstms_with_reviews_subwords_dataset.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
